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Abstract — Given a set of k networks, possibly with differ- 
ent sizes and no overlaps in nodes or edges, how can we 
quickly assess similarity between them, without solving the node- 
correspondence problem? Analogously, how can we extract a 
small number of descriptive, numerical features from each graph 
that effectively serve as the graph's "signature"? Having such 
features will enable a wealth of graph mining tasks, including 
clustering, outlier detection, visualization, etc. 

We propose NetSimile - a novel, effective, and scalable 
method for solving the aforementioned problem. NetSimile has 
the following desirable properties: (a) It gives similarity scores 
that are size-invariant, (b) It is scalable, being linear on the 
number of edges for "signature" vector extraction, (c) It does 
not need to solve the node-correspondence problem. We present 
extensive experiments on numerous synthetic and real graphs 
from disparate domains, and show NetSimile's superiority over 
baseline competitors. We also show how NetSimile enables sev- 
eral mining tasks such as clustering, visualization, discontinuity 
detection, network transfer learning, and re-identification across 
networks. 



I. Introduction 

We address the problem of network similarity. Specifically, 
given a set of networks of (possibly) different sizes, and 
without knowing the node-correspondences, how can we effi- 
ciently provide a meaningful measure of structural similarity 
(or distance)? For example, how structurally similar are the 
SDM and SIGKDD co-authorship graphs? How does their 
structural similarity compare with the similarity between the 
SDM and ICDM co-authorship graphs? Such measures are 
extremely useful for numerous graph-mining tasks. One such 
task is clustering: given a set of graphs, find groups of 
similar ones; conversely, find anomalies or discontinuities - 
i.e., graphs that stand out from the rest. Transfer learning is 
another application, if graphs G\ and G2 are similar, we can 
transfer conclusions from one to the other to perform across- 
network classification with better classification accuracy. Yet 
another application is re-identification across two graphs - 
where if the two graphs are similar, we can use information 
in one to re-identify nodes in the other. 

We define the network similarity / distance problem as 
follows. Input: A set of k anonymized networks of potentially 
different sizes, which may have no overlapping nodes or edges. 
Output: The structural similarity (or distance) scores of any 
pair of the given networks (or better yet, a feature vector for 



each network) Q 

The core of our approach, NetSimile, is a careful extrac- 
tion and evaluation of structural features. For every graph G, 
we derive a small number of numerical features, which capture 
the topology of the graph as moments of distributions over its 
structural features. The similarity score between two graphs 
then is just the similarity of their "signature" feature vectors. 
Once we have the similarity function, we can do a wealth 
of data mining tasks, including clustering, visualization, and 
anomaly detection. 

Our empirical study includes experiments on more than 
30 real- world networks and various synthetic networks gener- 
ated by four different graph generators (namely, Erdos-Renyi, 
Forest Fire, Watts -Strogatz, and Barabasi Preferential Attach- 
ment). We compare NetSimile with two baselines. The first 
baseline extracts frequent subgraphs from the given graphs and 
performs pairwise comparison on the intersection of the two 
sets of frequent patterns. The second baseline computes the 
k largest eigenvalues of each network's adjacency matrix and 
measures the distance between them. 

Our experiments provide answers to the following ques- 
tions: How do the various methods compare w.r.t. their sim- 
ilarity scores? Are their results intuitive (e.g., is a social 
network more similar to another social network than to a 
technological network)? How do they compare to null models? 
Are the methods just measuring the sizes of the networks in 
their comparisons? How scalable are the various methods? 
Can we build a useful taxonomy for networks based on their 
similarities? 

Proof of Concept. When measuring similarity, having a 
representative set of features for each graph is very pow- 
erful. The simplest way to illustrate this power is through 
visualization. Given a set of k graphs (which we describe 
and which include co-authorship networks, 



in Section III-A 



autonomous systems networks, Erdos-Renyi graphs, etc), we 
can use NetSimile to extract a graph x feature matrix, 
and then project this matrix into its principal component space 
through Singular Value Decomposition (SVD). 

Figure [T] depicts the scatterplots of the NetSimile graphx 
feature matrix' 1st vs. 2nd principal components, 3rd vs. 4th 
principal components, and 5th vs. 6th principal components. 

throughout the paper, we assume similarity and distance are interchange- 
able. 
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(a) 1st vs. 2nd principal components 
of NetSimile features 



(b) 3rd vs. 4th principal components 
of NetSimile features 



(c) 5th vs. 6th principal components 
of NetSimile features 



Fig. 1. An illustration of the visualization power of NetSimile's features. After doing SVD on the aggregated-feature matrix W [k graphs x fr aggregated- 
features] (where f = # features and r = # aggregators), we get k points in low-dimensional space, where similar graphs naturally cluster, (a) The query-log 
graphs (in red) fall on a line, (b) The Erdos-Renyi graphs (brown) fall on the vertical line, (c) The "autonomous systems" graphs (in green) group together. 



As the plot clearly shows, NetSimile's graph "signature" 
vectors are discriminative enough to separate out the various 
types of graphs and to find anomalous graphs among the set 
of input graphs. For example, in Figure [TJa), the synthetic 
"Barabasi-lOOK-nodes" stands out, while a group of "query- 
log" graphs align (red circles). In Figure [TJb), the Erdos-Renyi 
graphs (in brown) occupy the vertical axis, while in Figure [TJc) 
the green points ("Oregon AS" graphs) cluster together. 
a) Contributions: 

• Novelty: By using moments of distribution as aggregators, 
NetSimile generates a single "signature" vector for each 
graph based on the local and neighborhood features of its 
nodes. 

• Effectiveness: NetSimile produces similarity / distance 
measures that are size-independent, intuitive, and inter- 
pretable. 

• Scalability: The runtime complexity for generating Net- 
Simile's "signature" vectors is linear on the number of 
edges. 

• Applicability: NetSimile's "signature" vectors are use- 
ful in many graph mining tasks. 

The rest of the paper is organized into the following 
sections: Proposed Method, Experiments, Related Work, and 
Conclusions. 

II. Proposed Method 

Algorithm [T] outlines NetSimile, which has three steps: 
feature extraction (Algorithm [2]), feature aggregation (Algo- 
rithm [3]), and comparison (Algorithm]?]). 

Feature extraction. NetSimile's feature extractor (see 
Algorithm [2]) generates a set of structural features for each 
node based on its local and egonet-based features - a node's 
egonet is the induced subgraph of its neighboring nodes. 
Specifically, NetSimile computes the following 7 features. 

• di — \N(i)\: number of neighbors (i.e. degree) of node 
i; N(i) denotes the neighbors of node i. 

• cf. clustering coefficient of node i, defined as the number 
of triangles connected to node i over the number of 
connected triples centered on node i. 



Algorithm 1 NetSimile 

Require: {Gi, G2, " , G k }, doClustering 
1: // extract features from nodes 

2: {F Gi ,Fg 2 ,-- ,F g J:=GetFeatures({Gi,G2,-" ,G k }) 
3: // generate "signature" vectors for each graph 
4: {s Gl , sg 2 , • • • , s Gk }:=AGGREGATOR({F Gl , Fq 2 , • • • , F Gk }) 
5: //do comparison and return similarity/distance values or cluster- 
ings for the given graphs 
6: return Compare({sgi , sb 2 r " i SG k }, doClustering) 



Algorithm 2 NetSimile's GetFeatures 

Require: {Gi,G 2 ,-" ,G k } 
1: for all j e {Gi,G 2 , • • • ,G k } do 
2: FGj=\\ II initialize feature matrix for Gj 
3: // extract features for all nodes in Gj 
4: for all i G V 3 &o 
5: F Gj =F Gj U_ 

6: {{di,a,d N{ih c N(ih \E ego(i) \, \E° go(i) \, \N(ego(i))\}} 

7: end for 
8: end for 

9: // return a set of node x feature matrices 
10: return {F Gl , F G2 , • • • ,F Gje } 



• d N ^y. average number of node z's two-hop away neigh- 
bors, computed as j- ^vjeiv(i) d r 

• c N ^y. average clustering coefficient of N(i) , calculated 

aS d~ ZlvjGA^(i) c r 

• \Eego(i)\' number of edges in node z's egonet; ego(i) 
returns node z's egonet. 

• \E° go ^\: number of outgoing edges from ego(i). 

• \N(ego(i))\: number of neighbors of ego(i). 

Note that NetSimile is flexible enough to incorporate 
additional features. We choose these seven local and egonet- 
based features because they satisfy our constraints in terms 
of effectiveness (namely, size-independence, intuitiveness, and 
interpretability) and scalability (see Section [III]). Also empiri- 
cally, we observed the aforementioned features to be sufficient 
for measuring similarity across graphs from various domains 
(details in Section [lTl|. 

Feature aggregation. After the feature extraction step, 
NetSimile has extracted a node x feature matrix, Fq., for 



Algorithm 3 NetSimile's Aggregator 

Require: {F Gl , F G2 , • • • , F Gk } 
1: for all j e {F Gl , F G2 , • • • ,F G J do 
2: ^Gj = D initialize "signature" vector for Gj 
3: // for each feature column in F Gj , compute a set of aggregates 

4: for all feat £ Fg^ do 
5: s Gj =s Gj U 

{median (feat), mean(feat), stdev(feat), 
skewness(feat), kurtosis (feat)} 

end for 
9: end for 

10: // return a set of "signature" vectors for the graphs 
11: return {s Gl , sb 2 , • • • , s Gfe } 



Algorithm 4 NetSimile's Compare 

Require: {sg 1 , sb 2 ? • • • 5 sb fc }, doClustering 
1: if doClustering then 
2: // return clusterings of the given graphs 
3: return CLUSTER({s Gl , sb 2 , • • • , sb fe }) 
4: else 

5: // return similarity/distance values of the graphs 
6: return PairwiseCompare({s Gi ,sg 2 ,-' > }) 
7: end if 



each graph Gj G {Gi, (?2? ■ • • , We can measure similar- 
ity between graphs by comparing their feature matrices (see 
discussion below). However, we discovered that generating a 
single "signature" vector for each graph produces more effi- 
cient and effective comparisons. To this end, Net Simile uses 
the following five aggregators on each feature (i.e., on each 
column of i 7 b J ) : median, mean, standard deviation, skewness, 
and kurtosis. Note that the latter four of the five aggregators are 
moments of distribution of each feature. Net Simile is flexible 
enough to use other aggregators as well, though we found 
these five to be sufficient for the task of network comparison 
and satisfy our effectiveness and scalability constraints (see 
Section [III]). 

Comparison. After the feature aggregation step, NetSim- 
ile has produced a "signature" vector for every graph 
Gj G {Gi, G 2 , • • • , G/e}. NetSimile now has the whole ar- 
senal of clustering techniques and pairwise similarity / distance 
functions at its disposal. Amongst the collection of pairwise 
similarity / distance functions, we found Canberra Distance 

( R Q) = Et 
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) to be very discriminative 



(a good property for a distance measure). This is because 
Canberra Distance is sensitive to small changes near zero; 
and it normalizes the absolute difference of the individual 



comparisons (see discussion in Section [Hi). 

Computational complexity. Let k = number of graphs 
given to NetSimile (i.e., k = |{Gi,--- ,G k }\), rij = the 
number of nodes in Gj, rrij = the number of edges in Gj, / 
= number of structural features extracted, and r = number of 
aggregators used|^] 

Lemma 1: The runtime complexity for generating Net- 
Simile's "signature" vectors is linear on the number of edges 

2 Note that f,r, and k are in the 10s. 



in {Gi, • • • , G/e}, and specifically 

k 

0(J2(fnj + frijloginj))) (1) 
i=i 

where / <C rij <C nrtj and rijlog(nj) « nrtj in real- world 
graphs. 

Proof: To generate NetSimile's "signature" vectors, 
features need to be extracted and then aggregated. 

Feature Extraction: Recall that NetSimile is computing 
local and neighborhood-based structural features. As proved 
in [1], computation of neighborhood-based features is expected 
to take 0(rij) for real- world graphs. Therefore to compute / 
neighborhood-based features on a graph Gj, it takes O(frij). 

Feature Aggregation: This is 0(frijlog(jij)) for each graph 
Gj. Recall that NetSimile's aggregators are median, mean, 
standard deviation, skewness, and kurtosis. The latter four 
can be computed in one-pass through the / feature val- 
ues. The most expensive computation is the median which 
cannot be done in one-pass. However, it can be computed 
in 0(nlog(n) + n) for n numbers. Basically, one needs 
0(nlog(n)) to sort the n numbers. Then, a selection algorithm 
can be used to get the median with only O(n) operations. ■ 

Remark: Network comparison through statistical hy- 
pothesis testing. Given the node x feature matrices of two 
graphs, Fq x and Fq 2 , NetSimile can use statistical hypothe- 
sis testing to see if the two graphs are samples from the same 
underlying distribution. Specifically, NetSimile normalizes 
each column (i.e. feature) in Fq 1 and Fq 2 by its L2 norm. 
Then, NetSimile does pairwise hypothesis testing across 
the features of the graphs. For example, it does hypothesis 
testing between the degree columns in Gi and G2; between the 
clustering coefficient columns in G\ and G2; and so on. This 
process produces seven p-values (corresponding to the seven 
features extracted by NetSimile). To decide whether the two 
graphs are from the same underlying distribution, NetSimile 
uses the maximum p-value. We also tried the average of the p- 
values, though that analysis did not produce as discriminative 
results as the maximum p-value. 

For the statistical hypothesis tests, NetSimile can use 
any test available. We tried the Mann- Whitney Test [2] and 
the Kolmogorov-Smirnov Test [3]. The Mann- Whitney Test 
is nonparametric. It assumes two samples are independent 
and measures whether the two samples of observations have 
equally large values. The Kolmogorov-Smirnov Test is also 
nonparametric. We used the two- sample Kolmogorov-Smirnov 
Test which compares two samples w.r.t. the location and shape 
of the empirical cumulative distribution functions of the two 
samples. We found that neither test generated enough dis- 
criminative powei^Jto effectively capture differences between 
graphs (though the Mann-Whitney Test was more discrimina- 
tive). See Section [Tll| for details. 

Remark: Network comparison at the local- vs. global- 
level. Whether one prefers local-level network similarity to 

3 We informally define discriminative power to be the power to make fine 
distinctions. More details in Section ITTT1 



global-level network similarity depends on the application for 
which the similarity is being used. Net Simile is designed 
such that it can take either local-level or global-level features. 
Here, we emphasis NetSimile's local-level network similar- 
ity. The advantages of local-level comparison is that node- 
level and egonet-level features are often more interpretable 
than global features - e.g., consider average degree of a node 
vs. the number of distinct eigenvalues of the adjacency matrix. 
Also, local-level features are computationally less expensive 
than global-level features - e.g., consider clustering coefficient 
of a node vs. diameter of the graph. Moreover, looking at 
local-level features answers the question: "are the given two 
networks from similar linking models?" For example, consider 
the Facebook and Google+ social networks. Even though 
Google+ is a smaller network than Facebook, are its users 
linking in a similar way to the users of the Facebook network? 
In other words, is the smaller Google+ network following 
a similar underlying model as the lager Facebook network? 
Local-level features can capture any similarity present in the 
linking models of the two networks, but global-level features 
cannot. 

III. Experiments 

This section is organized as follows. First, we outline 
the real and synthetic datasets used in our experiments, as 
well as our experimental setup. Second, we describe two 
baseline methods "FSM" (Frequent Subgraph Mining) and 
"EIG" (Eigenvalues Extraction). Third, we present results 
that answer the following questions: How do the different 
approaches compare? Is there a particular method which 
clearly outperforms the others? If yes, to which extent? How 
can we interpret the results? Can we build a taxonomy over the 
networks based on our results? Is Net Simile affected by the 
sizes of the networks? How do the proposed methods scale? 
How well does Net Simile perform in various graph mining 
applications? 

A. Data and Experimental Setup 

Real Networks. Table [J lists the basic statistics of the real 
networks used in our experiments. Here is a short description 
of each network. 

• arXiv ( jhttp : //arxiv . org| ): Five different co- 
authorship networks corresponding to the following 
fields: Astro Physics, Condensed Matter, General Rel- 
ativity, High Energy Physics and High Energy Physics 
Theory. 

• DBLP-C { jhttp : //dblp . uni-trier . de] ): Six dif- 
ferent co-authorship networks from VLDB, SIGKDD, 
CIKM, ICDM, SIGMOD and WWW conferences, each 
spanning over 5 years (2005-2009). 

• DBLP-Y: Five different co-authorship networks, each 
corresponding to one of the years from 2005 to 2009 
and consisting of data from 31 conferences. 

• IMDb ( |ht tp : / /www . imdb . com] ): Five collaboration 
networks for movies issued from 2005 to 2009. Each node 
represents a person who took part in the movie (i.e., cast 
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TABLE I 



Real networks: #nodes, #edges, avg degree, network 
clustering coefficient (transitivity), avg node clustering 
coefficient, #nodes in the largest connected component, 
#connected components. 



and crew). Edges connect people who collaborated on a 
movie. 

• QueryLog ( |http : / /www . gregsadetsky . com/ 
aol-data): Five word co-occurrence networks built 
from a query-log of approximately 20 millions web- 
search queries submitted by 650,000 users over 3 
months. 

• Oregon AS ( |http: / / snap . Stanford . edu/ 
data/): Five Autonomous Systems (AS) routing graphs 
between March 31st and May 26th 2001. 

Synthetic Networks. Apart from the real networks, we 
also produced several synthetic networks by using the follow- 
ing generators from the igraph library ( |http : //igraph7| 
|sourcef orge . net) : 

• Barabasi- Albert [4]: With a non-assortative version of 
the generator, we created graphs with IK, 10K, and 100K 
nodes, adding 4 edges in each iteration. 

• Forest-Fire [5]: We generated graphs of size IK, 10K, 
and 100K nodes, with 20% forward burning probability, 
40% backward burning probability, and 4 ambassador 
vertices. 

• Erdos-Renyi [6]: We used the G(n, m) generator, where 
n is the number of nodes and m the number of edges, 



and produced graphs G(n, 2n) with IK, 10K, and 100K 
nodes. 

. Watts-Strogatz [7]: We built graphs of size 200, 2K, 
and 20K nodes by setting the lattice dimension to 1, the 
degree to 4, and the rewiring probability to 0.3. 

For each generator and for each node-set size, we built five 
networks. Our results report the average values obtained across 
the five networks per generator and node-set size. 

Experimental Setup. We implemented our approach in 
C++ and Matlab, making use of the GNU Statistic Libraries 
and igraph. The code was run on a server equipped with 8 
Intel Xeon processors at 3.0GHz, with 16GB of RAM, and 
running CentOS 5.2 Linux. 

B. Baseline Methods 

We compare NetSimile with (a) Frequent Subgraph Min- 
ing and (b) Eigenvalues Extraction. We chose these two meth- 
ods because they are intuitive and widely applicable. Many 



methods discussed in Section [IV] are application-dependent. 

FSM (Frequent Subgraph Mining): Given two graphs, 
we take the intersection of their frequent pattern- sets and 
build two vectors (one per graph) of relative supports of their 
patterns [8]. We compare these FSM vectors with NetSim- 
ile's "signature" vectors using Cosine Similarity and Canberra 
Distance. A clear drawback of FSM is its lack of scalability 
(since it relates to subgraph isomorphism). 

EIG (Eigenvalues Extraction): This is an intuitive measure 
of network similarity that is based on global feature extraction 
(as opposed to the local feature extraction of NetSimile). 
For each graph, we compute the k largest eigenvalue^] of its 
adjacency matrix, and thus we obtain a vector of size k per 
graph. Then, we use the Canberra Distance in order to compare 
these vectors and find the pairwise similarities between the 
graphs. A disadvantage of EIG is that it is size dependent: 
larger networks - or ones with larger LCC (Largest Connected 
Component) - have higher eigenvalues. Thus, EIG will lead 
to higher similarity between networks with comparable sizes. 
Moreover, there is no global upper-bound for eigenvalues, 
making distance values hard to compare. 
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TABLE II 

Properties of NetSimile and baselines 



C. Comparative Results 

Table [IT] summarizes the basic properties of NetSimile, 
FSM, and EIG. NetSimile is the only one having the desired 
properties of being scalable, size-independent, intuitive, and 
interpretable. Here, being intuitive means providing results that 

4 We tried a few values for k and saw no significant changes around 10; so 
we selected k = 10. 



are comparable with the background knowledge that we have 
about the data. For example, intuitively, a co-authorship net- 
work should be more similar to another co-authorship network 
than to a technological network like autonomous systems. 
Moreover, interpretability is given by the low complexity of 
the theory behind the concepts that build NetSimile. In fact, 
the signature vectors returned by NetSimile are comparing 
the moments of distributions of local (i.e. neighborhood- 
based) features of the two graphs, and these are commonly 
understandable measures. Empirical support follows next. 

For each method (NetSimile, FSM, and EIG), after 
extracting features from the graphs and obtaining one (ag- 
gregated) feature vector per graph, we apply the Canberra 
Distance]^] We also report results of the Mann-Withney U test 
on Net Simile's local feature distributions. This statistical test 
is unsuitable for EIG and non-trivial for FSM. 

Figure [2] depicts the results of a set of experiments involving 
the Canberra Distance on the DBLP-C datasets. The columns 
in Figure [2] correspond to the heatmaps we obtained from 
NetSimile, FSM and EIG, respectively. The first row reports 
the results from the Canberra Distance. The second row reports 
the results from the scaled Canberra Distance, where each 
value is in [0,1]. 

Inspecting Figures [2ja)-(b), we observe that the results of 
NetSimile are similar to FSM. For instance, according to 
both, the d-vldb network is similar to d-sigmod, which is not 
so similar to d-sdm. However, the discriminative power of 
NetSimile becomes evident when we inspect Figures [2jd)- 
(e). In these figures, the Canberra Distance is scaled, so the 
results can be compared on an equal footing. We observe that 
the scaled values of FSM are much less discriminative. 

Reexamining Figure [2] we also observe that the results 
from EIG differ from the ones from NetSimile and FSM. 
According to EIG, d-vldb has no significant differences with 
d-sigmod, d-cikm, and d-sigkdd; while NetSimile and FSM 
found differences. Moreover, there is no global normalization 
for the EIG values]^] Thus, global comparisons of a set of net- 
works are harder to interpret with EIG than with NetSimile 
and FSM. 

As another point of comparison, we measure the entropy in 
feature vectors generated by NetSimile, FSM, and EIG on 
the DBLP-C co-authorship networks. As Figure [3] shows, Net- 
Simile's feature vectors have higher entropy than FSM's or 
EIG's. Higher entropy means more uncertainty (i.e., we need 
more bits to store the desired information). So, NetSimile's 
feature vectors capture the nuances (i.e. uncertainty) in the 
graphs bettern than FSM or EIG, which then leads to more 
discriminative power when comparing graphs. 

Figure Qa) depicts results of NetSimile (Scaled Canberra 
Distance) on all datasets described in Section |III-A| Fig- 



ure [5p}) shows the maximum p-values on NetSimile's local- 

5 For brevity, we do not report results on Cosine Similarity and the 
other similarity/distance measures, which we tried. Most results are highly 
correlated [9]. 

6 It is possible to do pairwise normalization by the number of nodes, but 
this is not general for any set of networks. 
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Fig. 2. Comparative results: Canberra Distance scores between the DBLP-C networks by NetSimile (1st col.), FSM (2nd col.), and EIG (3rd col.). Second 
row is the scaled Canberra Distance £ [0, 1]. The baseline methods do not have the discriminative power of NetSimile. 
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Fig. 3. Entropy of feature vectors generated by NetSimile, FSM, and EIG 
on the DBLP-C co-authorship networks. NetSimile's feature vectors have 
higher entropy than FSM's or EIG's, which implies that they are capturing 
the nuances in the graphs better than FSM or EIG. 



feature distributions obtained by running the Mann-Withney U 
Test. The former (NetSimile Scaled Canberra Distance) is 
more discriminative in teasing out graph similarity than the 
latter (Mann- Whitney U Test). For instance, Figure [5jb) has 
many black grids corresponding to a maximum p-value of 
0. So, even though some pairwise comparisons are possible, 
the Mann-Whithney U Test is not able to capture differences 
between any two networks. For brevity, we have omitted 



TABLE III 

NetSimile withx-means classifies the networks in an intuitive 

WAY. 



the average p-value results, which were comparable with the 
maximum p-value results. 

D. Interpretability of Results 

To make sense of our results, we exploit the background 
knowledge about the networks used in our experiments. Amid 
the real networks, we have three sets of collaboration networks 
(DBLP-C, DBLP-Y and IMDb), one technological network 
(Oregon AS), and a word co-occurrence network (Query Log). 
In addition, we have different synthetic networks generated 
by various commonly used models. One would expect these 
networks to be "clustered" by their types. This idea was 



inspired by the considerations found in [10], where a large set 
of networks of different types are analyzed, together with their 
typical global and local features. For these experiments, we use 
two clustering algorithms: (1) agglomerative clustering [11] 
with Canberra Distance and unweighted average linking and 
(2) x-means clustering [12]. We chose the former since hi- 
erarchical clustering allows for easy interpretation of results. 
We chose the latter because it is a nonparametric version of k- 
means, where the number of clusters k is picked automatically 
through model selection. 

Figure |4ja) presents the dendrogram of all of our networks 
built by hierarchical agglomerative clustering with unweighted 
average linking and the Canberra Distance and using Net- 
Simile's graph "signature" vectors. The network names are 
colored by data set. As evident in Figure |4ja), there is a 
clear distinction between the clusters. The collaboration net- 
works appear all together, along with the forest fire synthetic 
networks. The Oregon AS forms a cluster that only at the 
height of 0.45 joins with the Query Log. The Erdos-Renyi and 
Watts-Strogatz form a separate cluster. This, in turns, reflects 
our aforementioned intuition about following our background 
knowledge of the data. Similar results are obtained by applying 
the x-means clustering on the vectors of local features, for 



networks (see Figure |6ja)) and the scatterplot for the Net- 
Simile values and the ratio between the average clustering 
coefficients of the nodes of the two networks (see Figure [6jb)). 
As evident in these scatterplots, NetSimile's results are not 
merely reflecting the difference in sizes of the networks. If 
they were, we would expect to observe correlations among the 



which we report the outcome in Table III A part from the 
distribution of the random networks, the clusters reflect what 
we observe in Figure |4ja). 

Figure |4jb) shows the dendrogram for the above experi- 
ment (hierarchical agglomerative clustering with unweighted 
average linking and the Canberra Distance) for graph vectors 
generated by EIG. This figure clearly shows a different 
picture, where the networks are grouped differently (see how 
the distribution of the colors is mixed). For example, in the 
leftmost cluster, two collaboration networks from arXiv are 
put together with four Query Log networks, while the missing 
Query Log network is placed together with the Oregon AS 
networks. The EIG results are not intuitive, thus making EIG 
not suitable for interpreting graph- similarity results. 

E. Similarity of Networks with Different Sizes 

One question that may arise regarding Net Simile is 
whether its results are affected by the differences in sizes or 
other basic statistics of the two networks being compared. We 
do not want the size to play an important role in our solutions 
given that our interpretation of the question "are two networks 
similar?" leads to the question "do the two networks follow 
the same (or similar) underlying linking model?". 

To answer the aforementioned questions, we compared the 
relationships between the Net Simile with Canberra Distance 
and some basic statistics of our real and synthetic networks. 
Specifically, we compared Net Simile values of two networks 
with the ratio between their (1) number of nodes, (2) number 
of edges, (3) average clustering coefficients of the nodes, 
(4) average degree, (5) maximum degree, and (6) network 
clustering coefficient. In all of them, we saw no correlation. 
For brevity, we only show the scatterplot for the NetSimile 
values and the ratio between the number of nodes of the two 
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(b)Mann- Whitney U Test: 
Maximum p-values on Local-Feature Distributions 

Fig. 5. NetSimile outperforms the Mann- Whitney U test on local feature 
distributions by producing more discriminative results. Plots are heatmaps of 
scores of all pairs of networks. Grid lines indicate ground truth, marking 
groups of networks. The ideal methods should have high scores (white in (a), 
yellow in (b)) on the diagonal blocks. 
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4. Hierarchical dendrograms of all network based on (a) NetSimile with Canberra Distance, and (b) EIG with Canberra Distance. Network : 
colored by data set. Homogeneity in colors (NetSimile's dendrogram) indicates better and more intuitive groupings (than EIG's dendrogram). 
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Fig. 6. NetSimile Canberra Distance is not measuring size, as there is no 
clear evidence of correlation between the two axes. 



points in each scatterplot. This implies that we can generate 
two networks of the same kind, with different sizes (e.g., two 
Forest-Fire networks [5] of sizes 10K and 100K nodes) and 
NetSimile would find them similar. 

E Scalability 

Table [Tv| reports the run times (in seconds) of NetSimile 
and the two baselines (FSM and EIG) when applied to our 
real networks. Note that for NetSimile the run times refer to 
all the three steps described in Section [TTJ with the comparison 
step constituted by the pairwise computation of both the 
Cosine Similarity and the Canberra Distance. For FSM we 
do not report running time longer than two days. 

NetSimile and EIG are able to compare graphs in a 
matter of seconds, though EIG produces results that are size- 
dependent. FSM pays for its subgraph isomorphism, which 
considerably affects the performances. Note that FSM is 
affected not only by the size of the network, but also by its 
type. While DBLP is a set of collaboration networks (with 
sparsely connected cliques), the Oregon AS (being a technol- 
ogy network) is made of one single connected component, 
thus the cost for the isomorphism becomes much higher. 
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TABLE IV 

RUN TIMES (IN SECONDS, UNLESS OTHERWISE NOTED) OF NETSIMILE, 
FSM, AND EIG ON OUR REAL NETWORKS 



G. Applications 

NetSimile can be used in numerous graph mining appli- 
cations. Here we discuss three of them. 



NetSimile as a Measure of Node-Overlap. Given three 
graphs Ga, Gb, and Gc of the same domain (e.g., co- 
authorship networks in SIGMOD, VLDB and ICDE), can we 
use only their NetSimile 's "signature" vectors to gauge the 
amount of node-overlap between them? Our hypothesis is 
that if graph Ga is more similar to graph Gb than graph 
Gcs then Ga will have more overlap in terms of nodes with 
Gb than Gc- To test this hypothesis, we ran NetSimile 
with Canberra Distance on our real networks. Figure [TJa) 
depicts the scatterplot of NetSimile results on graphs within 
each comparable group (i.e., arXiv, DBLP-C, DBLP-Y, IMDb, 
Query Log, and Oregon AS graphs). The y-axis is the nor- 
malized node overlap and is equal to \^2aD}^ mL=. As the 

y/\V GA \x\V GB \ 

figure shows the lower the NetSimile Canberra Distance, 
the higher the normalized node intersection. This confirms our 
hypothesis that NetSimile can be used to gauge node-overlap 
between two graphs without node correspondence information. 
Figure (TJb) shows the same scatter plot, but computed using 
the EIG Canberra Distance approach. In this case, there is no 
correlation between node overlap and the distance. Due to its 
scalability issues, the FSM approach could not be computed 
on all the networks in Figure [7] 

NetSimile as a Network Labeler. Given a new (never 
before seen) graph, can we use the Canberra Distance be- 
tween its NetSimile's "signature" vector to known graphs' 
NetSimile "signature" vectors to accurately predict its label? 
To answer this question, we setup and ran the following 4- 
step experiment. In step 1, we created a set of test graphs by 
generating 50 synthetic graphs of types Erdos-Renyi, Watts- 
Strogatz, Barabasi, and Forest Fire. In step 2, for each test 
graph, we compared its NetSimile score using the normal- 
ized Canberra Distance with existing graphs (as reported in 
Table [I]). In step 3, we assigned to the test graph the label of 
its most similar graph. In step 4, we computed the accuracy 
of our predictions. 

The predictive accuracy of NetSimile was 100% - i.e., 
NetSimile was able to label all 50 test graphs accurately. 
For each of the 50 test graphs, we inspected the NetSimile 
normalized Canberra Distance between the most similar graph 
(whose label we chose) and the second most similar graph 
(whose label we did not choose). Let's call the former dist\ 
and the latter disti. The minimum difference between dist\ 
and dist2 across the 50 test graphs was 0.001. The maximum 
was 0.428. The mean difference was 0.143; and the standard 
deviation was 0.112. Thus, the answer to the aforementioned 
question of whether NetSimile can be used effectively as a 
network labeler is yes. 

We ran the same experiments using EIG with Canberra 
Distance on the same networks. The predictive accuracy 
of EIG was 72%, i.e., 14 graphs were uncorrectly labeled. 
Figure [8] shows the distribution of the ranking for the correct 
labels of graphs. There are two cases, in which the correct 
(i.e. true) labels for the graphs are ranked 1 1th by EIG. Due 
to scalability issues, FSM could not be performed on all the 
networks in this experiment. 
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Fig. 7. (a) NetSimile Canberra Distance on DBLP, IMDb, Oregon and 
QueryLog. (b) EIG Canberra Distance on the same networks. NetSimile 
is an effective measure for node overlap without any node-correspondence 
information. The lower the NetSimile Canberra Distance, the higher the 
normalized node intersection. This correlation does not hold for EIG. The 
points in both plots are along the fitted lines. For NetSimile (a), the root 
mean square of residuals are 6.5E-2 for DBLP-C, 2.6E-2 for DBLP-Y, 
9.0E-3 for IMDb, 1AE-2 for Oregon AS, and 6.5E-2 for Query Log. For 
EIG (b), the root mean square of residuals are 8.2E-2 for DBLP-C, 4.2E-2 
for DBLP-Y, 1.3E-3 for IMDb, 1.2E-2 for Oregon AS, and 6.7E-2 for 
Query Log. 



NetSimile as a Discontinuity Detector. Given a time- 
series of graphs {Gi, G 2 , G 3 , G t }, can NetSimile detect 
any discontinuity (i.e. temporal outliers) present in the data? 
To answer this question, we utilize NetSimile Canberra 
Distance to compute the difference between graphs in a time 
series. For this experiment, we used data coming from two 
different messaging services, Yahoo! IM and Twitter. 

The first dataset contains 28 days of Yahoo ! IM communica- 
tions ( http : //sandbox . yahoo . com) , starting Tuesday, 
April 1, 2008. Each graph is a collection of instant messages 
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Fig. 8. Distribution of the rankings of the correct labels for the synthetic 
graphs by EIG with Canberra Distance. EIG's accuracy is only 72%. There 
are two graphs whose correct labels are ranked 1 1 . 



(IMs) per day, with nodes representing IM users and links 
denoting communication events. The graphs are of varying 
sizes: number of nodes from 29K to 100K and number of 
edges from 80K to 280K. We computed the NetSimile 
normalized Canberra Distance between Day (April 1, 2008) 
and the other 27 days. Figure [5Ja) shows our results, with 
the x-axis representing days and the y-axis representing Net- 
Simile (with normalized Canberra Distance) between Day 
and the other 27 days. Figure ^b) shows NetSimile (with 
normalized Canberra Distance) between Day 8 (April 9, 2008) 
and all the other days. As the figures illustrate, NetSimile 
detects the weekday vs. weekend discontinuities. It also detects 
a discontinuity on Wednesday April 9, 2008. The following 
event explains this discontinuity. Flickr announced that it will 
add video to its popular photo- sharing communit\y on April 
8, 2008; but its news spread on April 9, 20080 This event 
is reflected in the graph for April 9, 2008, where the number 
of connected components decreases by 4x as the news about 
Flickr spreads among the IM users. 

The second dataset contains 30 days of Twittei^ ©replies 
(i.e., messages that begin with a direct mention to "@user"), 
starting Monday, June 1, 2009. The sizes of these graphs vary 
less than in the Yahoo! data: number of nodes range from 4K 
to 8K, and number of edges range from 2K to 5K. Similar 
to the Yahoo! experiments (detailed above), we computed 
the NetSimile normalized Canberra Distance between Day 



and the other 30 days. Figure 10 a) shows our results, 
with the x-axis representing days and the ?/-axis representing 
NetSimile (with normalized Canberra Distance) between 



Day and the other 30 days. Figure [K)[b) shows NetSimile 
(with normalized Canberra Distance) between Day 6 (June 7, 



2009) and all the other days. Figure 10 shows no particular 
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Fig. 9. NetSimile detects discontinuities in time-evolving graphs, (a) 
Distance of day 1, day 2, • • • , day 27 IM graphs from day (Tuesday April 
1, 2008) IM graph. Weekdays are distinguished from weekends (yellow line= 
Saturday, blue line = Sunday). The peak on the 2nd Wednesday (April, 9, 
2008) corresponds to a big Flicker announcement and a Microsoft offer to 
buy Yahoo!, (b) Distance of the other days from day 8 (April 9, 2008) IM 
graph. All the other days are distant from April 9, 2008. 



periodicity. This is not surprising since Twitter ©replies 
are semi-private conversations between two people and their 
common followers (unlike instant messages that are private). 
The ©replies tweets have less of the "news amplification" 
effect as regular tweets [13]. However, NetSimile is still able 
to spot a significant discontinuity on day 6 (June 7, 2009). 
On that day, the largest cross-country election in the history 
took place (namely, the European parliamentary elections), 
affecting almost 500M people; parliamentary elections also 
took place in Lebanon; high school students graduated in the 
U.S.; and Roger Federer became the sixth man in tennis history 
to complete a career Grand Slam (by winning the French Open 
on that day) and tied Pete Sampras' Grand Slam record. 

IV. Related Work 

Assessing the similarity between two "objects" comes up 
in numerous settings. Thus, the literature is rich in simi- 
larity measures for various domains: distributions or multi- 
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Fig. 10. NetSimile detects discontinuities in time-evolving graphs, (a) 
Distance of day 1, day 2, • • • , day 30 Twitter graphs from day (Monday 
June 1 2009) graph. Weekdays are distinguished from weekends (yellow line= 
Saturday, blue line = Sunday), (b) Distance of the other days from day 6 
(June 7, 2009) Twitter graph. All the other days are distant from June 7, 
2009. Twitter @ replies are semi-private conversations; thus, we do not expect 
to see periodicity (like in IM conversations). On June 7, 2009, elections were 
held for the European Parliament and for the Lebanese Parliament, Roger 
Federer wins the French Open and makes tennis history, and high-schoolers 
graduate in the U.S. 



dimensional points [9], datacubes [14], and graphs, such as 
social ([15], [16]), information [17], and biological networks 
[18]. Here, we focus on graph similarity when the node 
correspondence is unknown. The proposed methods can be 
divided into three main classes. 

(1) Graph isomorphism. The similarity of the graphs 
depends on whether the graphs are isomorphic to each other 
[19]; or one graph is a subgraph of the other ([20], [21]); or 
they have common subgraphs [22]. The exact algorithms of 
these problems are exponential, rendering them inapplicable 
to the large graphs on which our research focuses. Graph 
edit distance is a generalization of the graph isomorphism 
problem which consists of finding the minimum number of 
operations (insertions, deletions, renaming of nodes, reversion 



of edges) that is required to convert one graph to another. 
Different cost functions ([23], [24]) have been proposed in the 
literature. It is noteworthy to mention that graph isomorphism 
addresses mainly the graph matching problem (roughly, given 
two graphs, find the correspondence between their nodes). It 
does not directly address the graph similarity problem, which 
is the topic of this paper. 

(2) Iterative methods. Two well known representatives of 
this category are SimRank [25] and similarity flooding [26]. 
The former computes all the pairwise similarities between the 
nodes. The latter attempts to find the correspondences between 
the nodes of the graphs in order to assess their similarity. Zager 
et al. [27] proposed a method that computes the similarity of 
the graphs by coupling the pairwise similarities between the 
nodes with the similarities between the corresponding edges. 

(3) Feature Extraction. These methods, which are based 
on comparing specific graph features, are popular due to their 
scalability. The challenge here is the appropriate selection 
of features/patterns, since some features, such as frequent 
subgraphs ([8], [28], [18], [29], [30]), can be computationally 
expensive to extract. Macindoe et al. [16] and Faust [15] focus 
on comparing social networks by extracting socially relevant 
features. GraphGrep [31] extracts paths by doing random 
walks on the graphs. GString [32] converts the graphs into 
sequences and extracts features from the latter. G-Hash [33] 
applies a wavelet matching kernel to indirectly extract infor- 
mation about the neighborhoods of the nodes. Papadimitriou 
et al. [17] compare web graphs, where the correspondence of 
the nodes is known. Henderson et al. [1] propose a method 
for mining recursive structural features. NetSimile is easily 
extensible to incorporate these features. Lastly, Li et al. [34] 
propose a classification approach of attributed graphs, which 
is based on global feature extraction. The weakness of the 
proposed method is that some features (e.g., eccentricity and 
shortest paths) are computationally expensive, and, thus, it is 
not scalable on large graphs. Moreover, the method is domain- 
specific and focuses in databases of graphs, such as chemical 
compounds, while our work aims at comparing graphs of 
different domains. 

In this work, we focus on the local feature extraction 
approach. We aggregate a number of carefully chosen, in- 
terpretable, intuitive, and computationally inexpensive local 
features that capture the nuances in the structural information, 
and then employ various techniques (similarity measures [9], 
hierarchical clustering, and hypothesis testing) in order to find 
the pairwise similarity scores of the given (possibly, cross- 
domain) networks. 

V. Conclusions 

We introduced NetSimile, a novel, effective, size- 
independent, and scalable method for comparing large net- 
works. NetSimile has three components: (1) feature extrac- 
tion, (2) feature aggregation, and (3) comparison. The heart 
of our contribution is in components (1) and (2), where we 
discovered that moments of distributions of structural features 
computed on the nodes and their egonets provide an excellent 



"signature" vector for a graph. These "signature" vectors can 
be used to effectively and quickly assess the similarity of two 
or more graphs. 

Our broader contributions are: 

• Novelty: Net Simile avoids the (expensive) node corre- 
spondence problem, as well as adjusts for graph size. 

• Effectiveness: Net Simile gives results that agree with 
intuition and the ground-truth. 

• Scalability: NetSimile generates its "signature" vectors 
in time linear on the input size (i.e., number of edges of 
the input graphs). 

• Applicability: NetSimile 's "signature" vectors are use- 
ful in numerous graph mining tasks. In addition, Net- 
Simile is easily extensible to include features and ag- 
gregators besides the ones presented. 
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